BUG-22796 Concat multicolumn tz-aware DataFrame #23036

tonytao2012 · 2018-10-08T01:48:14Z

closes Concat of dataframe with tz-aware datetime column against dataframe without, fails #22796
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Numpy arrays don't have the datetimetz dtype, so I just passed through the DatetimeIndex directly.

Side note: There's another small bug (I think) where np.nan or pd.NaT takes on the dtype of the column instead of the row when concatenating, but the column should instead have an object dtype.

pep8speaks · 2018-10-08T01:48:16Z

Hello @tonytao2012! Thanks for updating the PR.

There are no PEP8 issues in the file pandas/core/dtypes/dtypes.py !
There are no PEP8 issues in the file pandas/core/internals/concat.py !
There are no PEP8 issues in the file pandas/tests/frame/test_combine_concat.py !

Comment last updated on October 09, 2018 at 12:08 Hours UTC

mroeschke · 2018-10-08T03:04:33Z

pandas/tests/frame/test_combine_concat.py

@@ -53,6 +53,21 @@ def test_concat_multiple_tzs(self):
        expected = DataFrame(dict(time=[ts2, ts3]))
        assert_frame_equal(results, expected)

+    def test_concat_tz_NaT(self):


Could you also add a test where ts1 is NaT?

Good catch, it results in an object column instead of the correct dtype. I'm not entirely sure how to solve this issue, as it appears to happen when the dataframe is created before the concat even occurs.

ts1 = pd.Timestamp(pd.NaT, tz='UTC') df1 = pd.DataFrame([[ts1, ts2]]) df1[0] Out[37]: 0 NaT Name: 0, dtype: datetime64[ns]

Any ideas?

That behavior is theoretically correct if NaT was replaced by a naive Timestamp. NaT is an edge case where it can exist with a datetime64[ns, tz] dtype.

Don't worry about solving that in this PR. Feel free to open up another issue about that case.

Sounds good, should I still add the new test?

You can add the test with ts1 as NaT and xfail it, but link it to a new issue.

mroeschke · 2018-10-08T03:05:25Z

The bug you're describing may be #12499

codecov · 2018-10-08T05:07:39Z

Codecov Report

Merging #23036 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #23036      +/-   ##
==========================================
- Coverage   92.19%   92.19%   -0.01%     
==========================================
  Files         169      169              
  Lines       50904    50911       +7     
==========================================
+ Hits        46933    46939       +6     
- Misses       3971     3972       +1

Flag	Coverage Δ
#multiple	`90.61% <100%> (-0.01%)`	⬇️
#single	`42.3% <14.28%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/internals/concat.py	`98% <100%> (-0.38%)`	⬇️
pandas/core/dtypes/dtypes.py	`95.58% <100%> (+0.03%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ca7d518...de4c427. Read the comment docs.

tonytao2012 · 2018-10-08T06:28:22Z

Added the xfail test and linked to issue #23037. Thanks for the feedback.

TomAugspurger · 2018-10-08T13:34:20Z

pandas/tests/frame/test_combine_concat.py

@@ -53,6 +55,24 @@ def test_concat_multiple_tzs(self):
        expected = DataFrame(dict(time=[ts2, ts3]))
        assert_frame_equal(results, expected)

+    @pytest.mark.parametrize('t1', ['2015-01-01',
+        pytest.param('pd.NaT', marks=pytest.mark.xfail(


Why is pd.NaT in quotes? Shouldn't it just be pd.NaT?

TomAugspurger · 2018-10-08T13:36:56Z

doc/source/whatsnew/v0.24.0.txt

@@ -883,6 +883,7 @@ Reshaping
 - Bug in :func:`pandas.wide_to_long` when a string is passed to the stubnames argument and a column name is a substring of that stubname (:issue:`22468`)
 - Bug in :func:`merge` when merging ``datetime64[ns, tz]`` data that contained a DST transition (:issue:`18885`)
 - Bug in :func:`merge_asof` when merging on float values within defined tolerance (:issue:`22981`)
+- Bug in :func:`pandas.concat` when merging multicolumn DataFrames with tz-aware data (:issue`22796`)


Can you be more specific here? The isn't an issue with all multi-column DataFrames with tz-aware data. What are the exact conditions needed?

jreback · 2018-10-08T14:22:54Z

pandas/core/internals/concat.py

@@ -186,6 +188,11 @@ def get_reindexed_values(self, empty_dtype, upcasted_na):

                if getattr(self.block, 'is_datetimetz', False) or \
                        is_datetimetz(empty_dtype):
+                    if self.block is None:
+                        missing_arr = np.full(self.shape, fill_value)
+                        missing_time = DatetimeIndex(missing_arr[0],


can u walk back up the stack and see exactly where block is None- this is a guarantee iirc of this function to not be None so this might be an error higher up

Block is none when there are a mismatched number of columns between the two concatenating dataframes. I assumed this was correct behavior?

when I last changed this, I was really trying NOT to use DTI directly here as we are trying isolate things like that to higher level methods.

See if you can revise, otherwise I will take a look.

Hmm, in that case, I'm not sure what to do. Clearly, block is supposed to be None in cases where the columns are mismatched:

for blkno, placements in libinternals.get_blkno_placements(blknos, mgr.nblocks, group=False): assert placements.is_slice_like join_unit_indexers = indexers.copy() shape = list(mgr_shape) shape[0] = len(placements) shape = tuple(shape) if blkno == -1: unit = JoinUnit(None, shape)

When block is None, we have to create and return some array in get_reindexed_values, but np arrays can't have a tz dtype. I apologize if I'm missing something obvious. Feel free to take over the issue if you'd like, as I'm unsure of how to continue from here. I'll also keep thinking about it.

jreback · 2018-10-08T21:50:33Z

pandas/core/internals/concat.py

@@ -186,6 +188,11 @@ def get_reindexed_values(self, empty_dtype, upcasted_na):

                if getattr(self.block, 'is_datetimetz', False) or \
                        is_datetimetz(empty_dtype):
+                    if self.block is None:
+                        missing_arr = np.full(self.shape, fill_value)
+                        missing_time = DatetimeIndex(missing_arr[0],


when I last changed this, I was really trying NOT to use DTI directly here as we are trying isolate things like that to higher level methods.

See if you can revise, otherwise I will take a look.

pandas/tests/frame/test_combine_concat.py

TomAugspurger · 2018-10-09T13:18:57Z

Just to make sure, on this PR I get

In [1]: import pandas as pd

In [2]: a = pd.DataFrame({"A": pd.to_datetime([1, 2]).tz_localize("UTC")})

In [3]: b = pd.DataFrame({"A": pd.to_datetime([1, 2]).tz_localize("UTC"), "B": pd.to_datetime([1, 2]).tz_localize("UTC")})

In [4]: pd.concat([a, b], sort=True).B.values
Out[4]:
array([                          'NaT', '1970-01-01T00:00:00.000000001',
       '1970-01-01T00:00:00.000000002'], dtype='datetime64[ns]')

In [5]: pd.concat([a, b], sort=True)
Out[5]:
                                    A                                   B
0 1970-01-01 00:00:00.000000001+00:00                                 NaT
1 1970-01-01 00:00:00.000000002+00:00 1970-01-01 00:00:00.000000001+00:00
0 1970-01-01 00:00:00.000000001+00:00 1970-01-01 00:00:00.000000002+00:00
1 1970-01-01 00:00:00.000000002+00:00

the xfail you added @jreback is for the fact that B in Out[5] is incorrect?

jreback · 2018-10-09T16:03:46Z

hmm, that does look suspicious but that's not the reason for the xfail.

jreback · 2018-10-09T16:11:29Z

pushed an updated, that was a just-introduced bug.

jreback · 2018-10-09T17:19:52Z

one more time to fix the lint issue.

jreback · 2018-10-09T18:42:08Z

thanks @tonytao2012

welcome to have you work on the issue that we are xfailing

tonytao2012 added 6 commits October 7, 2018 19:04

BUG-22796 Concat multicolumn tz-aware DataFrame fails

9e760a0

Fixed shape issues

5ef5dd2

Pass when block is not None

c640f69

Added whatsnew entry

5e57d77

Merge remote-tracking branch 'upstream/master' into BUG-22796

0730adf

Fixed whitespace issues

2e89848

mroeschke reviewed Oct 8, 2018

View reviewed changes

mroeschke added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Timezones Timezone data dtype labels Oct 8, 2018

Added additional test, linked to new issue

53c85aa

TomAugspurger reviewed Oct 8, 2018

View reviewed changes

jreback requested changes Oct 8, 2018

View reviewed changes

tonytao2012 added 2 commits October 8, 2018 16:30

Fixed test, explained bug in more detail for whatsnew

51ef6e5

Edited whatsnew

54c153d

jreback requested changes Oct 8, 2018

View reviewed changes

jreback added 2 commits October 9, 2018 07:54

Merge branch 'master' into PR_TOOL_MERGE_PR_23036

d6ab866

use construct_array_type

a0570bd

TomAugspurger reviewed Oct 9, 2018

View reviewed changes

pandas/tests/frame/test_combine_concat.py Outdated Show resolved Hide resolved

add strict

644b349

jreback approved these changes Oct 9, 2018

View reviewed changes

jreback added this to the 0.24.0 milestone Oct 9, 2018

handle alignment case

ce31274

TomAugspurger approved these changes Oct 9, 2018

View reviewed changes

jreback added 2 commits October 9, 2018 13:19

Merge branch 'master' into PR_TOOL_MERGE_PR_23036

93e97dc

lint

de4c427

jreback merged commit 574eb75 into pandas-dev:master Oct 9, 2018

tm9k1 pushed a commit to tm9k1/pandas that referenced this pull request Nov 19, 2018

BUG-22796 Concat multicolumn tz-aware DataFrame (pandas-dev#23036)

58e4035

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG-22796 Concat multicolumn tz-aware DataFrame #23036

BUG-22796 Concat multicolumn tz-aware DataFrame #23036

tonytao2012 commented Oct 8, 2018 •

edited

Loading

pep8speaks commented Oct 8, 2018 •

edited

Loading

mroeschke Oct 8, 2018

tonytao2012 Oct 8, 2018

mroeschke Oct 8, 2018

tonytao2012 Oct 8, 2018

mroeschke Oct 8, 2018

mroeschke commented Oct 8, 2018

codecov bot commented Oct 8, 2018 •

edited

Loading

tonytao2012 commented Oct 8, 2018

TomAugspurger Oct 8, 2018

TomAugspurger Oct 8, 2018

jreback Oct 8, 2018

tonytao2012 Oct 8, 2018

jreback Oct 8, 2018

tonytao2012 Oct 8, 2018 •

edited

Loading

jreback Oct 8, 2018

TomAugspurger commented Oct 9, 2018

jreback commented Oct 9, 2018 •

edited

Loading

jreback commented Oct 9, 2018

jreback commented Oct 9, 2018

jreback commented Oct 9, 2018

BUG-22796 Concat multicolumn tz-aware DataFrame #23036

BUG-22796 Concat multicolumn tz-aware DataFrame #23036

Conversation

tonytao2012 commented Oct 8, 2018 • edited Loading

pep8speaks commented Oct 8, 2018 • edited Loading

Comment last updated on October 09, 2018 at 12:08 Hours UTC

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mroeschke commented Oct 8, 2018

codecov bot commented Oct 8, 2018 • edited Loading

Codecov Report

tonytao2012 commented Oct 8, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tonytao2012 Oct 8, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Oct 9, 2018

jreback commented Oct 9, 2018 • edited Loading

jreback commented Oct 9, 2018

jreback commented Oct 9, 2018

jreback commented Oct 9, 2018

tonytao2012 commented Oct 8, 2018 •

edited

Loading

pep8speaks commented Oct 8, 2018 •

edited

Loading

codecov bot commented Oct 8, 2018 •

edited

Loading

tonytao2012 Oct 8, 2018 •

edited

Loading

jreback commented Oct 9, 2018 •

edited

Loading